These solutions are only for the private use of the students of DS 6030 Fall 2024. Sharing of the solutions, posting on the internet, selling to a company, or possession by anyone outside of the class constitutes a violation of the honor policy.
Required R packages and Directories
data_dir ='https://mdporter.github.io/teaching/data'# data directorylibrary(ks) # functions for KDElibrary(tidyverse) # functions for data manipulation
Problem 1 Geographic Profiling
Geographic profiling, a method developed in criminology, can be used to estimate the home location (roost) of animals based on a collection of sightings. The approach requires an estimate of the distribution the animal will travel from their roost to forage for food.
A sample of \(283\) distances that pipistrelle bats traveled (in meters) from their roost can be found at:
One probability model for the distance these bats will travel is: \[\begin{align*}
f(x; \theta) = \frac{x}{\theta} \exp \left( - \frac{x^2}{2 \theta} \right)
\end{align*}\] where the parameter \(\theta > 0\) controls how far they are willing to travel.
a. Derive a closed-form expression for the MLE for \(\theta\) (i.e., show the math).
Solution
This is the Rayleigh distribution, named after Lord Rayleigh.
The log-likelihood is: \[\begin{align*}
\log L(\theta) = \sum_{i=1}^n \log X_i - n \log \theta - \frac{1}{2\theta} \sum_{i=1}^n X_i^2
\end{align*}\]
b. Estimate \(\theta\) for the bat data using MLE?
Calculate using the solution to part a, or use computational methods.
Solution
#: Read in datadata = readr::read_csv(file.path(data_dir, 'geo_profile.csv'),col_names =FALSE) x = data$X1 # convert to vector#: Calculate MLEn =length(x)theta =sum(x^2)/(2*n)
The MLE is \(\hat{\theta} = 4.52\).
c. Plot the estimated density
Using the MLE value of \(\theta\) from part b, calculate the estimated density at a set of evaluation points between 0 and 8 meters. Plot the estimated density.
The x-axis should be distance and y-axis should be density (pdf).
Solution
#: Function to estimate Rayleigh densitydrayleigh <-function(x, theta) (x/theta) *exp(-x^2/(2*theta))#: Calculate density on grid using MLE parameter valuetheta_hat =sum(x^2)/(2*n) plot_data =tibble(eval.pts =seq(0, 8, length=200),density =drayleigh(eval.pts, theta=theta_hat))#: Plot using ggplot2ggplot(plot_data, aes(eval.pts, density)) +geom_line() +# plot with densityannotate("rug", x=x) +# add ruglabs(x='distance (meters)', y='density') # change labels
d. Estimate the density using KDE.
Report the bandwidth you selected and produce a plot of the estimated density.
Solution
There is no right answer to the bandwidth, especially because you are not told what we are going to do with the density estimate; thus even a visual reason is sufficient.
The ks package gives a few options for 1D bandwidth selection (hpi, hlscv, hucv) and any of these would be sufficient as well. As stated in the ks manual, normal kernels are used and the bandwidth corresponds to the standard deviation of the kernel.
Note: The density estimate should have non-negative support since you can’t observe negative distances. The kde() function has an argument positive=TRUE which would give the necessary edge corrections (in this case log transform); however there is no bandwidth selector that incorporates this constraint.
library(ks) # uses normal kernel#: Choose bandwidth with least squares cross-validation(bw =hlscv(x))
#: Plot the results using base R:plot(dens_kde, las=1)lines(dens_pos$eval.points, dens_pos$estimate, las=1, col="purple") # edge correction rug(x)
e. Which model do you prefer, the parametric or KDE? Why?
Solution
Here again, there is no right answer. Some advantages of the parametric include:
Theoretical support for Rayleigh distribution.
Simple, one parameter model is easy to explain and doesn’t require much data to fit.
Has correct non-negative support
However, given sufficient data and a suitable bandwidth, the KDE method will perform almost as well, even if the data is really from a Rayleigh distribution. One drawback is that the KDE method, unless corrected, can assign density to distances less than 0. The more events that are close to the boundary, the worse this problem will be.
The plots below will allow a visual comparison.
library(tidyverse)#: make data for plottinghist_data =tibble(x)dens_data =tibble(x_seq =seq(0, 8, length=200), rayleigh =drayleigh(x_seq, theta_hat), kde =kde(x, h=bw, eval.points=x_seq)$estimate, kde.pos =kde(x, h=bw, eval.points=x_seq, positive=TRUE)$estimate)
A more rigorous test, in this situation, would be to compare their predictive performance using cross-validation or a hold-out set of events.
#: Monte Carlo cross-validation# Using negative log density as the loss functionloss <-function(f.x) -sum(log(f.x))M =100hout =10Loss =tibble()set.seed(2021)for(m in1:M){#: sample data ind_eval =sample(n, hout) x_eval = x[ind_eval] x_fit = x[-ind_eval]#: fit models theta_hat =sum(x_fit^2)/(2*length(x_fit)) # MLE for Rayleigh distribution bw = ks::hlscv(x_fit) # bandwidth for KDE model#: estimate density on hold-out observations f_rayleigh =drayleigh(x_eval, theta_hat) f_kde =kde(x_fit, h = bw, eval.points=x_eval)$estimate f_kde.pos =kde(x_fit, h = bw, eval.points=x_eval, positive=TRUE)$estimate#: evaluate predictions Loss =bind_rows( Loss,tibble(rayleigh =loss(f_rayleigh), kde =loss(f_kde), kde_pos =loss(f_kde.pos),iter = m ) )}# average lossLoss %>%summarize(across(-iter, mean))
# A tibble: 1 × 3
rayleigh kde kde_pos
<dbl> <dbl> <dbl>
1 17.0 17.2 17.4
It turns out that the Rayleigh distribution had the lowest loss (i.e., the largest out-of-sample likelihood).
Problem 2: Interstate Crash Density
Interstate 64 (I-64) is a major east-west road that passes just south of Charlottesville. Where and when are the most dangerous places/times to be on I-64? The crash data (link below) gives the mile marker and fractional time-of-week for crashes that occurred on I-64 between mile marker 87 and 136 in 2016. The time-of-week data takes a numeric value of <dow>.<hour/24>, where the dow starts at 0 for Sunday (6 for Sat) and the decimal gives the time of day information. Thus time=0.0417 corresponds to Sun at 1am and time=6.5 corresponds to Sat at noon.
A mile marker bandwidth of 2.92 corresponds to the standard deviation of a Gaussian kernel. Thus, the density at mile marker \(110\) is practically effected by events within about 3 bandwidths or 8.75 miles. So only events between mile markers 101.25 and 118.75 have a practical impact on the density estimate at mile marker \(110\).
c. Use KDE to estimate the temporal time-of-week density.
Report the bandwidth.
Plot the density estimate.
Solution
library(ks)#: Use smoothed cross-validation bandwidth(bw.time =hscv(X$time))
A temporal (time-of-week) bandwidth of 0.41 corresponds to the standard deviation of a Gaussian kernel. Thus, the density on Tuesday at 6am (time=2.25) is practically effected by events within about 3 bandwidths or 1.23 days. So only events occurring between 1.02 and 3.48.
Note: The time axis is better represented as circular instead of linear such that events occurring near the boundaries of 0 and 7 should have more influence on each other. If you are interested in this idea check out the term Directional Statistics.
d. Use KDE to estimate the bivariate mile-time density.
Report the bandwidth parameters.
Plot the bivariate density estimate.
Solution
#: Estimate density. Use smooth cross-validation(H =Hscv(X))
The bandwidth provides an understanding of how “local” the density estimate is. We can plot a 95% Gaussian confidence ellipse around point (110, 2.25) to see the kernel shape and size.
Notice that the multivariate kernel bandwidth is estimated to be almost separable (i.e., the off-diagonal term is close to 0) producing an axis aligned ellipse. Also, the multivariate bandwidths are larger than the univariate equivalents; note that the square root of the diagonals of \(H\) correspond to the bandwidhts of the univariate.
sqrt(diag(H))
[1] 5.098 0.605
e. Crash Hotspot
Based on the estimated density, approximate the most dangerous place and time to drive on this stretch of road. Identify the mile marker and time-of-week pair (within a few miles and hours).
Solution
Here is a colored contour plot with the contour levels set at 5%, 25%, and 50%: